The overall goal of this product was to see if I could accurately show covid-19 trends throughout the pandemic with the data that the CDC provided. More specifically, a deeper dive into which states were effected the most, as well as how each varient effected the overall population within the United States. I wanted to accomplish this through accurate visual representations of the CDC's data. The data was grouped by state, from January 20th, 2020, all the way to January 28th, 2022. The main parts that I kept from the original data frame included total cases, number of new cases daily, total deaths, new deaths daily, submission dates, and the main 50 states + American territories.
This project was important to me as this was something that not only effected my life, but the lives of billions across the World. It's effected both my personal and professional career. Since the pandemic, our lives have been changed. We all had to adapt to working remotely, learning remotely, being careful as to who we see, when we see them, and how we see them. For many of us, this has played a role in our lives to this day, it could even effect our life style for the rest of our lvies. Some of us may never return to fully working in person. For me personally, I was not able to see my mother for nearly two whole years because of the pandemic. She is considered a high risk individual, due to prior health problems. Being that I lived not more than an hour away from her, this was unfortunate. My undergraduate program completely switched to online, I was not able to attend any labs, and even had a very unique graduation ceremony. My going out habits, and views on the world have also been effected during the pandemic. When the pandemic first started, I was constantly keeping up with news outlsets to find the latest news on COVID-19 and how it was effecting our day to day throughout the country. Eventually, after being on lockdown for so long, I just got used to staying indoors and doing my part. I no longer kept up with the trends and current events of COVID-19. This is why I wanted to take a look at the data, and map it out for myself. To see if I could take a look at the pandemics story, as a whole over the past two years. I wanted to see if I could identify when certain variants were present, how they effected us as a country, and other trends throughout.
My first hypothesis was that states with more leniant mask mandate like Florida and Texas, would have the highest number of cases, deaths, and suffered the most throughout the pandemic (ratio of deaths per cases). The second hypothesis that I focused on was that I would be able to clearly see when the original first few variants, delta varient, and Omicron varient were present within the United States. A hunch I had when going through the data was that California and New York would also be largely suffering from the pandemic as far as number of cases, deaths, and deaths per cases.
I found my data on the CDC's website:
https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36
The data can be exported in the top right, and viewed at the bottom of the page.
Cleaning the data was not as straight forward as I thought it would be. In fact, I found myself working to clean the data through multiple weeks in order to obtain what I needed for each EDA phase.
For starters the original data frame had a few columns that I wanted to drop such as: conf_cases, prob_cases, pnew_case, conf_death, prob_death, pnew_death, created_at, consent_cases, and consent_deaths.
This left me with five columns: submission_date state, tot_cases, new_case, tot_death, and new_death.
I then had to remove all commas in my df so that I could group my data and organize it as I would like.
This was followed up by converting my df objects to integers so that I could better work with them.
Next, since the state column was in abbreviations, such as 'CA' instead of California, I converted all of the states to full names, as I thought this would be more visually appealing while plotting and reading the data.
Throughout the EDA process I found that the American territories were serving as major outliers when plotting, so I decided it would be best to drop them in order to keep the plots and project focused on the main 50 states.
I also created a data frame that was grouped by state on the last day of my recoreded data, January 8th, 2022. This was used to plot all of the total values from my data.
Lastly, admitedly one I struggled with a lot, was that the CDC placed NYC as a seperate input to NY state. This was very confusing and I eventually had to combine NYC and NY to form one state as it was messing with my plotting results.
I eventually added a new column called, percentage_of_Deaths_by_cases, which was the ratio of the total deaths per total cases for each state. This was used to identify help identify which states suffered the most.
Submission_date: Date when data was submitted to the CDC, should be daily by each 50 states.
state: The name of the state
tot_cases: Accumulative number of cases for that state on that day + all the days previous in the dataframe.
new_case: The accumulative number of new COVID-19 cases on that submission date for that specific state.
tot_death: Accumulative number of deaths for that state on that day + all the days previous in the dataframe.
new_death: Total number of deaths on that submission date for that specific state.
percentage_of_deaths_by_cases: This is the ratio of total deaths per total cases on the final submission date of January 28th, 2022.
import pandas as pd
import numpy as np
import seaborn
import matplotlib.pyplot as plt
import plotly.express as px
#multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
#final data file:
df = pd.read_csv('Romzy_Safadi_covid19.csv')
df.info()
df.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 44280 entries, 0 to 44279 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 submission_date 44280 non-null object 1 state 44280 non-null object 2 tot_cases 44280 non-null object 3 conf_cases 23957 non-null object 4 prob_cases 23885 non-null object 5 new_case 44280 non-null object 6 pnew_case 40408 non-null object 7 tot_death 44280 non-null object 8 conf_death 23675 non-null object 9 prob_death 23675 non-null object 10 new_death 44280 non-null object 11 pnew_death 40305 non-null object 12 created_at 44280 non-null object 13 consent_cases 36895 non-null object 14 consent_deaths 37638 non-null object dtypes: object(15) memory usage: 5.1+ MB
| submission_date | state | tot_cases | conf_cases | prob_cases | new_case | pnew_case | tot_death | conf_death | prob_death | new_death | pnew_death | created_at | consent_cases | consent_deaths | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1/28/22 | GA | 2,346,518 | 1,824,347 | 522,171 | 18,785 | 2,900 | 32,868 | 27,502 | 5,366 | 150 | 23 | 1/29/22 14:30 | Agree | Agree |
| 1 | 1/28/22 | PR | 453,669 | 257,022 | 196,647 | 3,404 | 2,101 | 3,809 | 3,174 | 635 | 36 | 6 | 1/29/22 14:30 | Agree | Agree |
| 2 | 1/28/22 | FL | 5,501,599 | NaN | NaN | 22,705 | 5,753 | 64,647 | NaN | NaN | 7 | 1 | 1/29/22 14:30 | Not agree | Not agree |
| 3 | 1/28/22 | UT | 875,251 | 875,251 | 0 | 6,166 | 0 | 4,107 | 3,944 | 163 | 10 | -1 | 1/29/22 14:30 | Agree | Agree |
| 4 | 1/28/22 | HI | 208,253 | NaN | NaN | 1,848 | 207 | 1,152 | NaN | NaN | 5 | 0 | 1/29/22 14:30 | Not agree | Not agree |
#dropping columns to create df3
df3 = df.drop( columns = ['conf_cases', 'prob_cases', 'pnew_case', 'conf_death', 'prob_death', 'pnew_death',
'created_at', 'consent_cases', 'consent_deaths'])
#removing commas in df
df3['tot_death'] = df3['tot_death'].str.replace(',','')
df3['tot_cases'] = df3['tot_cases'].str.replace(',','')
df3['new_case'] = df3['new_case'].str.replace(',','')
df3['new_death'] = df3['new_death'].str.replace(',','')
#convert obects to int
df3["tot_cases"] = df3["tot_cases"].apply(pd.to_numeric)
df3["new_case"] = df3["new_case"].apply(pd.to_numeric)
df3["tot_death"] = df3["tot_death"].apply(pd.to_numeric)
df3["new_death"] = df3["new_death"].apply(pd.to_numeric)
# converting State abreviations to full name (ex: CA --> California)
df3["state"].replace({ "AL":"Alabama",
"AK":"Alaska",
"AZ":"Arizona",
"AR":"Arkansas",
"CA":"California",
"CO":"Colorado",
"CT":"Connecticut",
"DC":"Washington DC",
"DE":"Delaware",
"FL":"Florida",
"GA":"Georgia",
"HI":"Hawaii",
"ID":"Idaho",
"IL":"Illinois",
"IN":"Indiana",
"IA":"Iowa",
"KS":"Kansas",
"KY":"Kentucky",
"LA":"Louisiana",
"ME":"Maine",
"MD":"Maryland",
"MA":"Massachusetts",
"MI":"Michigan",
"MN":"Minnesota",
"MS":"Mississippi",
"MO":"Missouri",
"MT":"Montana",
"NE":"Nebraska",
"NV":"Nevada",
"NH":"New Hampshire",
"NJ":"New Jersey",
"NM":"New Mexico",
"NYC":"New York",
"NY" : "New York",
"NC":"North Carolina",
"ND":"North Dakota",
"OH":"Ohio",
"OK":"Oklahoma",
"OR":"Oregon",
"PA":"Pennsylvania",
"RI":"Rhode Island",
"SC":"South Carolina",
"SD":"South Dakota",
"TN":"Tennessee",
"TX":"Texas",
"UT":"Utah",
"VT":"Vermont",
"VA":"Virginia",
"WA":"Washington",
"WV":"West Virginia",
"WI":"Wisconsin",
"WY":"Wyoming",
"DC": "District of Columbia",
"AS": "American Samoa",
"GU": "Guam",
"MP": "Northern MAriana Isdlands",
"PR": "Puerto Rico",
"United States Minor Outlying Islands": "UM",
"VI": "U.S. Virgin Islands",
"RMI": "The Marshall Islands",
"FSM": "FEDERATED STATES OF MICRONESIA RELATIONS",
"PW": "Palau"
}, inplace = True)
#dropping everything that is not a part of the 50 major states
df3 = df3[df3["state"].str.contains("American Samoa") == False]
df3 = df3[df3["state"].str.contains("Puerto Rico") == False]
df3 = df3[df3["state"].str.contains("FEDERATED STATES OF MICRONESIA RELATIONS") == False]
df3 = df3[df3["state"].str.contains("Guam") == False]
df3 = df3[df3["state"].str.contains("Northern MAriana Isdlands") == False]
df3 = df3[df3["state"].str.contains("Palau") == False]
df3 = df3[df3["state"].str.contains("The Marshall Islands") == False]
df3 = df3[df3["state"].str.contains("U.S. Virgin Islands") == False]
#adjustdates to datetime64 type
df3['submission_date'] = pd.to_datetime(df3['submission_date'])
df3
df3.info()
#create df of final date, to gather total numbers by end -->
#true tot_cases/tot_death on final date
df4 = df3[df3['submission_date'] == '2022-01-28']
df4.head()
newdf = df4.groupby('state').sum().reset_index()
newdf.head()
#Calculate percentage of deaths from total cases and total deaths ((total deaths/total cases)*100)
df5 = newdf.copy()
df5["percentage_of_deaths_by_cases"] = (df5['tot_death'] / df5['tot_cases'] * 100).round(3)
df5.head()
| state | tot_cases | new_case | tot_death | new_death | percentage_of_deaths_by_cases | |
|---|---|---|---|---|---|---|
| 0 | Alabama | 1206308 | 10748 | 17086 | 39 | 1.416 |
| 1 | Alaska | 205241 | 5689 | 1052 | 4 | 0.513 |
| 2 | Arizona | 1829406 | 15610 | 26001 | 69 | 1.421 |
| 3 | Arkansas | 768061 | 5660 | 9616 | 20 | 1.252 |
| 4 | California | 8213786 | 76729 | 78825 | 254 | 0.960 |
#grouping df3 by submission date and state
dfgrouped = df3.groupby(['state', 'submission_date']).sum().reset_index()
dfgrouped.head()
| state | submission_date | tot_cases | new_case | tot_death | new_death | |
|---|---|---|---|---|---|---|
| 0 | Alabama | 2020-01-22 | 0 | 0 | 0 | 0 |
| 1 | Alabama | 2020-01-23 | 0 | 0 | 0 | 0 |
| 2 | Alabama | 2020-01-24 | 0 | 0 | 0 | 0 |
| 3 | Alabama | 2020-01-25 | 0 | 0 | 0 | 0 |
| 4 | Alabama | 2020-01-26 | 0 | 0 | 0 | 0 |
#create few states so can create scatter of few states
#Top 5 states by total cases and top 5 states by highest ratio of deaths
states = ['California','New York','Florida','Pennsylvania','Texas',
'Mississippi', 'New Jersey', 'Michigan', 'Connecticut', 'Arizona']
states_few = dfgrouped[dfgrouped['state'].isin(states)]
states_few
| state | submission_date | tot_cases | new_case | tot_death | new_death | |
|---|---|---|---|---|---|---|
| 1476 | Arizona | 2020-01-22 | 0 | 0 | 0 | 0 |
| 1477 | Arizona | 2020-01-23 | 0 | 0 | 0 | 0 |
| 1478 | Arizona | 2020-01-24 | 0 | 0 | 0 | 0 |
| 1479 | Arizona | 2020-01-25 | 0 | 0 | 0 | 0 |
| 1480 | Arizona | 2020-01-26 | 1 | 1 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... |
| 32467 | Texas | 2022-01-24 | 5973164 | 38924 | 76904 | 29 |
| 32468 | Texas | 2022-01-25 | 6018220 | 45056 | 77058 | 154 |
| 32469 | Texas | 2022-01-26 | 6048954 | 30734 | 77321 | 263 |
| 32470 | Texas | 2022-01-27 | 6083750 | 34796 | 77555 | 234 |
| 32471 | Texas | 2022-01-28 | 6122432 | 38682 | 77780 | 225 |
7380 rows × 6 columns
#group by submission dates, in order to show cases per day
df6 = df3.groupby(by = 'submission_date').aggregate(np.sum)
df6.index.name = 'Date'
df6 = df6.reset_index()
df6
| Date | tot_cases | new_case | tot_death | new_death | |
|---|---|---|---|---|---|
| 0 | 2020-01-22 | 0 | 0 | 0 | 0 |
| 1 | 2020-01-23 | 1 | 1 | 0 | 0 |
| 2 | 2020-01-24 | 2 | 1 | 0 | 0 |
| 3 | 2020-01-25 | 2 | 0 | 0 | 0 |
| 4 | 2020-01-26 | 3 | 1 | 0 | 0 |
| ... | ... | ... | ... | ... | ... |
| 733 | 2022-01-24 | 71391336 | 1077704 | 863638 | 2482 |
| 734 | 2022-01-25 | 71877826 | 486490 | 866751 | 3113 |
| 735 | 2022-01-26 | 72439170 | 561344 | 869829 | 2810 |
| 736 | 2022-01-27 | 73010190 | 571020 | 872453 | 2624 |
| 737 | 2022-01-28 | 73531094 | 520904 | 875755 | 3181 |
738 rows × 5 columns
#ratio of new deaths/new cases * 100
df7 = df6.copy()
df7["percentage_of_new_deaths_by_new_cases"] = (df6['new_death'] / df6['new_case'] * 100).round(3)
df7
| Date | tot_cases | new_case | tot_death | new_death | percentage_of_new_deaths_by_new_cases | |
|---|---|---|---|---|---|---|
| 0 | 2020-01-22 | 0 | 0 | 0 | 0 | NaN |
| 1 | 2020-01-23 | 1 | 1 | 0 | 0 | 0.000 |
| 2 | 2020-01-24 | 2 | 1 | 0 | 0 | 0.000 |
| 3 | 2020-01-25 | 2 | 0 | 0 | 0 | NaN |
| 4 | 2020-01-26 | 3 | 1 | 0 | 0 | 0.000 |
| ... | ... | ... | ... | ... | ... | ... |
| 733 | 2022-01-24 | 71391336 | 1077704 | 863638 | 2482 | 0.230 |
| 734 | 2022-01-25 | 71877826 | 486490 | 866751 | 3113 | 0.640 |
| 735 | 2022-01-26 | 72439170 | 561344 | 869829 | 2810 | 0.501 |
| 736 | 2022-01-27 | 73010190 | 571020 | 872453 | 2624 | 0.460 |
| 737 | 2022-01-28 | 73531094 | 520904 | 875755 | 3181 | 0.611 |
738 rows × 6 columns
#descending states by total death, to show which state had the most, and least,
#cumulative deaths by January 28, 2022.
tot_death_desc = newdf.sort_values(by = 'tot_death', ascending = False)
bar_desc_tot_death = px.bar(tot_death_desc, x = 'state', y = 'tot_death',title = "Total Deaths by State, Descending",
labels = dict(tot_death = "Total Deaths",
state = "State"))
bar_desc_tot_death.show()
Placing these in descending order was important as we are now able to see that California, Texas, Florida, New York, and Pennsylvania had the most deaths due to covid-19 over the past two years. While Vermont, Alaska, and Hawaii, had the least.
# Most cumulative cases by state, as of January 28, 2022
# total cases by state, descending barplot
tot_cases_desc = newdf.sort_values(by = 'tot_cases', ascending = False)
bar_tot_cases_desc = px.bar(tot_cases_desc, x = 'state', y = 'tot_cases',title = "Total Cases by State",
labels = dict(tot_cases = "Total Cases",
state = "State"))
bar_tot_cases_desc.show()
This is significant as it shows how many cases each state had over the pandemic. We can see that California, Texas, Florida, New York, and Pennsylvania are all at the top of the plot, with the most number of cases. This lead me to beleive that there is a correlation between total cases and total deaths.
#checking correlations
newdf.corr().style.background_gradient(cmap='coolwarm')
#create correlation heat map of total deaths and total cases per state
plt.figure(figsize=(4,4))
seaborn.heatmap(newdf.corr(), annot = True, cmap = 'coolwarm')
| tot_cases | new_case | tot_death | new_death | |
|---|---|---|---|---|
| tot_cases | 1.000000 | 0.871048 | 0.976014 | 0.526586 |
| new_case | 0.871048 | 1.000000 | 0.787194 | 0.464684 |
| tot_death | 0.976014 | 0.787194 | 1.000000 | 0.538976 |
| new_death | 0.526586 | 0.464684 | 0.538976 | 1.000000 |
<Figure size 288x288 with 0 Axes>
<AxesSubplot:>
Heat map and correlation table show strong correlation between total deaths and total cases.
Although there is a correlation between total deaths and total cases, we cannot say there is a direct causation. As the population of California is still much higher than states like Vermont. A better way to get a deeper understanding of the overall picture is to get the percentage of death rates of total cases per each state to see what states really did end up "suffering" the most
#plot in descending order to see which state suffered more in relation
#to their number of cases.
relation_death_to_case_per_state = df5.sort_values(by = 'percentage_of_deaths_by_cases',
ascending = False)
bar_death_per_case_desc = px.bar(relation_death_to_case_per_state,
x = 'state', y = 'percentage_of_deaths_by_cases',
title = "State vs (Total Death/Total Cases)",
labels = dict(percentage_of_deaths_by_cases = "Percentage of total deaths by total cases",
state = "State"))
bar_death_per_case_desc.show()
#descending order table; by percentage of deaths
relation_death_to_case_per_state
| state | tot_cases | new_case | tot_death | new_death | percentage_of_deaths_by_cases | |
|---|---|---|---|---|---|---|
| 38 | Pennsylvania | 2637717 | 15583 | 40394 | 137 | 1.531 |
| 24 | Mississippi | 717666 | 5533 | 10831 | 25 | 1.509 |
| 30 | New Jersey | 2102227 | 10118 | 31320 | 112 | 1.490 |
| 22 | Michigan | 2235180 | 14999 | 32197 | 25 | 1.440 |
| 6 | Connecticut | 696070 | 2684 | 9985 | 77 | 1.434 |
| 2 | Arizona | 1829406 | 15610 | 26001 | 69 | 1.421 |
| 0 | Alabama | 1206308 | 10748 | 17086 | 39 | 1.416 |
| 18 | Louisiana | 1105273 | 6483 | 15631 | 61 | 1.414 |
| 20 | Maryland | 949880 | 3011 | 13387 | 53 | 1.409 |
| 10 | Georgia | 2346518 | 18785 | 32868 | 150 | 1.401 |
| 28 | Nevada | 648088 | 4393 | 8914 | 39 | 1.375 |
| 21 | Massachusetts | 1598451 | 8149 | 21909 | 70 | 1.371 |
| 31 | New Mexico | 470513 | 5269 | 6417 | 26 | 1.364 |
| 32 | New York | 4764000 | 8558 | 63921 | 85 | 1.342 |
| 14 | Indiana | 1604072 | 17067 | 21301 | 136 | 1.328 |
| 48 | West Virginia | 438889 | 4668 | 5743 | 46 | 1.309 |
| 25 | Missouri | 1314435 | 9106 | 17111 | 9 | 1.302 |
| 35 | Ohio | 2562412 | 9440 | 33071 | 582 | 1.291 |
| 43 | Texas | 6122432 | 38682 | 77780 | 225 | 1.270 |
| 26 | Montana | 238801 | 2834 | 2993 | 3 | 1.253 |
| 3 | Arkansas | 768061 | 5660 | 9616 | 20 | 1.252 |
| 36 | Oklahoma | 963655 | 10539 | 12044 | 0 | 1.250 |
| 42 | Tennessee | 1844780 | 17692 | 22452 | 73 | 1.217 |
| 15 | Iowa | 712288 | 8922 | 8501 | 0 | 1.193 |
| 13 | Illinois | 2897174 | 15453 | 34439 | 141 | 1.189 |
| 9 | Florida | 5501599 | 22705 | 64647 | 7 | 1.175 |
| 12 | Idaho | 376095 | 3653 | 4400 | 35 | 1.170 |
| 41 | South Dakota | 225383 | 1145 | 2637 | 9 | 1.170 |
| 40 | South Carolina | 1349276 | 10892 | 15266 | 85 | 1.131 |
| 17 | Kentucky | 1140887 | 15706 | 12890 | 34 | 1.130 |
| 50 | Wyoming | 144526 | 1397 | 1625 | 0 | 1.124 |
| 46 | Virginia | 1535349 | 9743 | 16168 | 41 | 1.053 |
| 16 | Kansas | 722824 | 12986 | 7522 | 134 | 1.041 |
| 7 | Delaware | 246037 | 1307 | 2498 | 4 | 1.015 |
| 19 | Maine | 174217 | 1266 | 1737 | 4 | 0.997 |
| 8 | District of Columbia | 129817 | 0 | 1284 | 0 | 0.989 |
| 37 | Oregon | 620652 | 7431 | 6086 | 19 | 0.981 |
| 39 | Rhode Island | 341407 | 1836 | 3302 | 14 | 0.967 |
| 4 | California | 8213786 | 76729 | 78825 | 254 | 0.960 |
| 34 | North Dakota | 221025 | 2065 | 2093 | 3 | 0.947 |
| 5 | Colorado | 1240361 | 7083 | 11061 | 56 | 0.892 |
| 23 | Minnesota | 1309665 | 14548 | 11532 | 43 | 0.881 |
| 33 | North Carolina | 2374866 | 22631 | 20595 | 78 | 0.867 |
| 27 | Nebraska | 435358 | 0 | 3666 | 0 | 0.842 |
| 47 | Washington | 1294498 | 13483 | 10699 | 52 | 0.826 |
| 49 | Wisconsin | 1503420 | 8180 | 12291 | 75 | 0.818 |
| 29 | New Hampshire | 272492 | 2429 | 2205 | 12 | 0.809 |
| 11 | Hawaii | 208253 | 1848 | 1152 | 5 | 0.553 |
| 45 | Vermont | 94513 | 0 | 503 | 0 | 0.532 |
| 1 | Alaska | 205241 | 5689 | 1052 | 4 | 0.513 |
| 44 | Utah | 875251 | 6166 | 4107 | 10 | 0.469 |
We can now clearly see that Pennsylvania, suffered the most with 1.531% of its total cases resulting in deaths. We also know from prior plots and tables that Pennsylvania also was amongst the top 5 total cases and total deaths. Furthermore, states like Mississippi, New Jersey, and Michigan, although not in highest cases and deaths, still suffered the most as they were amongst the top 5 in ratio of deaths per cases.
#show in a few states at once, dates vs total deaths scatter:
#this is a scatter plot of date by total death of the top 5 states by cases and top 5 states by ratio of cases
scatter_few_states = px.scatter(states_few,
x = 'submission_date', y = 'tot_death', color = 'state',
title = "Date vs Total Deaths Few States",
labels = dict(submission_date = "Date",
tot_death = "Total Deaths"))
scatter_few_states.show()
This was significant because it shows us the trends of the states with the most deaths and highest suffering throughout the pandemic. As well as how the increased over time. We can see that for California, Texas, Florida, and New York, things really started to increase tremendously starting in the end of 2021. While all states saw a jump in the start of 2021.
#new cases on daily basis
new_cases_daily = px.area(df6, x = 'Date', y = 'new_case',
title = "New Cases on Daily Basis",
labels = dict(new_case = "New Cases"))
new_cases_daily.show()
This was important as we can now see the overall story of this pandemic within the United States over the past two years. We can identify that hill over Jan 2021 was the Beta variant, while the hill inbetween Jul 2021 and Oct 2021 is the Delta variant, and the large peaknear Jan 2022 is the Omicron variant. We can also have a hunch that Omicron was the most contageous. We cannot entirely conclude that from this data and graph alone. We would need to know What variant exactly each person tested positive for, to do that.
further information on when each variant was introduced to the United States can be found here:
https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/
#New deaths daily to show severity of different varients
new_deaths_daily = px.area(df6, x = 'Date', y = 'new_death',
title = "New Deaths on Daily Basis",
labels = dict(new_death = "New Deaths"))
new_deaths_daily.show()
Here we can see that it looks like the delta variant caused the most deaths. Yet we cannot say it is the most deadly, as the Beta and Alpha variants at the beginning were not far off, and we did not have a vaccine back then, while during the Delta variant, a good number of high risk people were vaccinated by then. In addition to this, based on this data alone, it would be hard to determine which variant is the most deadly as we would need more information. Furthermore, the Omicron variant had significantly more cases, yet we did not have the most deaths during that time. To get a better look as to what variants were the most lethal, we could take a etter look at the percentage of new deaths per new cases.
#graph of ratio of new deaths per new cases
new_deaths_per_cases_daily = px.area(df7, x = 'Date', y = 'percentage_of_new_deaths_by_new_cases',
title = "Ratio of New Deaths to New Cases",
labels = dict(percentage_of_new_deaths_by_new_cases = "Percentage of New Deaths per New Cases"))
new_deaths_per_cases_daily.show()
Looking at this plot we can see that the Alpha and Beta variants actually killed the most people in ratio to the number of cases we were having within the USA. Although it is compelling to say that these variants were the most deadly because they caused the most deaths, in proportion, we cannot conclude that from this data alone. We can just use this as a hunch or theory. As these results could be due to a multitude of variables. Such as limited testing early on during the pandemic. Furthermore, we can see that the ratio of deaths during the Omicron variant period was very low, even though there was an abundance of testing as well as more people vaccinated.
The purpose of this project was to show a timeline of how each state was effected through the pandemic. This was done through visualizing and analyzing the total deaths, total cases, new deaths, new cases, and dates on a cumulative and independent basis for the main fifty states of the USA.
My hypothesis was that states with looser mask mandates, such as Florida and Texas were going to be amongst the states that suffered the most. Initially when looking a the total number of cases and deaths, this seemed to be the case. However, when looking at the raito of deaths by cases for each state, we can see that Florida and Texas are not amongst the top five states that suffered through the pandemic. Shockingly, it seemed to have been Pennsylvania, as it was amongst the top five in total cases, total deaths, and number one when it came to the ratio of deaths per cases. At the same time, we saw that California, even though it had a high number of cases and deaths, was at the bottom quarter of states that suffered the most. These contradicting results could be due to a number of different factors. It could have been that the stricter mask mandate in California was helping, yet, Pennsylvania, which also had a relatively strict mask mandate was at the top of the list when it came to suffering. I beleive that there is one key red flag that stood out to me the most upoon completing this project. The fact that this data sent to the CDC was voluntary. This meant that each state could have biases to skew their data as they wished before submitting the data to the CDC. It also meant they could report as they want and as much as they want.
My second hypothesis/hunch was that this data could visually represent the story of the pandemic within the USA. My hunch was that looking at the data, one would be able to tell when each variant was entering the United States, and that we would be able to see how we reacted to the variants as a whole. This is evident through the data, we can see when we first started experiencing the alpha and delta varients in the end of 2020, with the delta variant eing right near the end of 2021, and ust before the Omicron variant. We can even see the surge of cases at the beginning of the pandamic.
Overall I was able to take the data and visualize the trends of covid throughout the pandamic, along with the varaints and their spikes that came with them. Furthermore, based on the CDC's volunatry data, the ratio of total deaths per total cases disproved my theory that states with loser mask mandates would have suffered the most. As the plots indicate, Pennsylvania is actually the state that suffered the most. Furthermore, although it is difficult to make a decisive conclusion, the ratio of deaths per cases helped theorize that the Delta variant could have been the most deadly covid19 variant within the USA to this date.
Possible next steps would be to compare the CDC's data to other instiutions, such as John Hopkins University, and a few other instiutions, to see how accurate the voluntary data. Furthermore, we could take this one step forward and obtain what variant each individual tested positive for (if possible).